Tasks

I. Perform an initial inspection of the cab ride dataset and do any data cleaning necessary to make your analysis more successful.

II. What are the most popular pick up/drop off areas in NY? What are the most popular journeys?

III. Identify any outliers in the daily volumes of cab rides. Can you provide a reason for the largest outlier?

Loading Librairies

# Data reading
library(readr)

# Data manipulation
library(dplyr) 
library(data.table)
library(tibble)
library(tidyr) 
library(stringr)
library(forcats)

# Data Visualisation
library(ggplot2)
library(scales)
library(RColorBrewer) 

# Date & Time
library(lubridate)

# Geospatial locations & Maps
library(geosphere) 
library(leaflet)
library(leaflet.extras)
library(maps)
library(htmltools)

# Weather
library(weathermetrics)

I. Data inspection and cleaning

I.1 Structure and overview

cabs <- read_csv("cab_data.csv") # read the data

Let’s have a first overview of the data structure and the variables.

head(cabs)
## # A tibble: 6 x 11
##   id    vendor_id pickup_datetime     dropoff_datetime    passenger_count
##   <chr>     <int> <dttm>              <dttm>                        <int>
## 1 id28~         2 2016-03-14 17:24:55 2016-03-14 17:32:30               1
## 2 id23~         1 2016-06-12 00:43:35 2016-06-12 00:54:38               1
## 3 id38~         2 2016-01-19 11:35:24 2016-01-19 12:10:48               1
## 4 id35~         2 2016-04-06 19:32:31 2016-04-06 19:39:40               1
## 5 id21~         2 2016-03-26 13:30:55 2016-03-26 13:38:10               1
## 6 id08~         2 2016-01-30 22:01:40 2016-01-30 22:09:03               6
## # ... with 6 more variables: pickup_longitude <dbl>,
## #   pickup_latitude <dbl>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>

Data dictionary & description

This dataset contains a sample of yellow cab rides taken in NY from the 1st of January 2016 to the 30th of June 2016 (6 months).

Data fields

id - a unique identifier for each journey

vendor_id - a code indicating the provider associated with the trip record

pickup_datetime - date and time when the journey started

dropoff_datetime - date and time when the journey completed

passenger_count - the number of passengers in the vehicle (driver entered value)

pickup_longitude - the longitude where the journey started

pickup_latitude - the latitude where the journey started

dropoff_longitude - the longitude where the journey completed

dropoff_latitude - the latitude where the journey completed

store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor becausethe vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip

trip_duration - duration of the trip in seconds

Let’s check some keys figures and stats of the data using summary and glimpse

summary(cabs)
##       id              vendor_id     pickup_datetime              
##  Length:1458644     Min.   :1.000   Min.   :2016-01-01 00:00:17  
##  Class :character   1st Qu.:1.000   1st Qu.:2016-02-17 16:46:04  
##  Mode  :character   Median :2.000   Median :2016-04-01 17:19:40  
##                     Mean   :1.535   Mean   :2016-04-01 10:10:24  
##                     3rd Qu.:2.000   3rd Qu.:2016-05-15 03:56:08  
##                     Max.   :2.000   Max.   :2016-06-30 23:59:39  
##  dropoff_datetime              passenger_count pickup_longitude 
##  Min.   :2016-01-01 00:03:31   Min.   :0.000   Min.   :-121.93  
##  1st Qu.:2016-02-17 17:05:32   1st Qu.:1.000   1st Qu.: -73.99  
##  Median :2016-04-01 17:35:12   Median :1.000   Median : -73.98  
##  Mean   :2016-04-01 10:26:24   Mean   :1.665   Mean   : -73.97  
##  3rd Qu.:2016-05-15 04:10:51   3rd Qu.:2.000   3rd Qu.: -73.97  
##  Max.   :2016-07-01 23:02:03   Max.   :9.000   Max.   : -61.34  
##  pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
##  Min.   :34.36   Min.   :-121.93   Min.   :32.18    Length:1458644    
##  1st Qu.:40.74   1st Qu.: -73.99   1st Qu.:40.74    Class :character  
##  Median :40.75   Median : -73.98   Median :40.75    Mode  :character  
##  Mean   :40.75   Mean   : -73.97   Mean   :40.75                      
##  3rd Qu.:40.77   3rd Qu.: -73.96   3rd Qu.:40.77                      
##  Max.   :51.88   Max.   : -61.34   Max.   :43.92                      
##  trip_duration    
##  Min.   :      1  
##  1st Qu.:    397  
##  Median :    662  
##  Mean   :    959  
##  3rd Qu.:   1075  
##  Max.   :3526282
glimpse(cabs)
## Observations: 1,458,644
## Variables: 11
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id          <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime    <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, ...
## $ dropoff_datetime   <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, ...
## $ passenger_count    <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...

We see that :

The dataset has a total of 1,458,644 observations and 11 variables.

vendor_id has two values (1 or 2), probably two different taxi companies operating in NYC.

pickup_datetime and dropoff_datetime is in combination format of date and time (to reshape later on).

passenger_count : the median value indicates that at least 50% of the journeys were made with only one passanger and at least 75% of them with 2 passengers maximum. The maximum is 9 passengers which a lot people inside a car! We’ll check theses values in the cleaning part.

longitude & latitude for both pickup and dropoff have extreme values which are very unlikely to be in NYC. To identify during the cleaning part as well!

trip_duration: Some very quick trips (1sec) and extremely long ones (~41 days). To investigate in the cleaning part! We can also notice through the 3rd quartile that at least 75% of the trips lasted less than 17 minutes.

I.2 Missing values or duplicate journeys

# Check potential missing values
sum(is.na(cabs))
## [1] 0

We are lucky! The data is complete without any missing values.

# Check if there are no duplicate journey through id
length(unique(cabs$id))
## [1] 1458644

No duplicate as well.

I.3 Reformate features

We will reformate the pickup and dropoff time as date object. We also put the vendor_id and passenger count as factor in order to make it easier to visualise relationships that involve these features.

cabs <- cabs %>% mutate(vendor_id = factor(vendor_id), # as factor
         passenger_count = factor(passenger_count), # as factor as well
        pickup_datetime = ymd_hms(pickup_datetime), # date object
         dropoff_datetime = ymd_hms(dropoff_datetime)) # date object

I.4 Finding outliers & cleaning

I.4.a Longitude & Latitude

As our analysis is focused on NYC, we will get rids of all the trips who were not in the NYC area. To do so, we will remoove them by taking into account the Bouding Box of New-York (found here https://www.flickr.com/places/info/2459115).

# Keeping the trips in the NYC area
# NYC Bounding Box: -74,2589, 40,4774, -73,7004, 40,9176
cabs = cabs %>% filter(pickup_longitude >= -74.2589 & pickup_longitude <= -73.7004 & 
                       dropoff_longitude >= -74.2589 & dropoff_longitude <= -73.7004 & 
                       pickup_latitude <= 40.9176 & pickup_latitude >= 40.4774 & 
                       dropoff_latitude <= 40.9176 & dropoff_latitude >= 40.4774)
nyc_map = leaflet() %>%
    addTiles() %>%
    addRectangles(lat1 = 40.9176, lng1 = -74.2589, lat2 = 40.4774, lng2 = -73.7004)
nyc_map

Above we can see in the blue rectangle the zone we kept for our analysis.

I.4.b Trips duration

We saw in the summary that some trips duration are extremely short or long. Let’s investigate with visualizations.

cabs %>%
  ggplot(aes(trip_duration)) +
  geom_histogram(fill = "blue", bins = 200) +
  scale_x_log10() +
  scale_y_sqrt()

Note: logarithmic x-axis and square-root y-axis alows to visualize the graph properly.

We see that :

  • Strange short rides during less than 10 seconds.

  • Strange small peak of very long trip duration and extreme outliers.

  • Still, rides follow a rather smooth distribution.

Very long trips investigation

Let’s have a closer look at the extremely long trips :

cabs %>%
  arrange(desc(trip_duration)) %>% head(10)
## # A tibble: 10 x 11
##    id    vendor_id pickup_datetime     dropoff_datetime    passenger_count
##    <chr> <fct>     <dttm>              <dttm>              <fct>          
##  1 id00~ 1         2016-02-13 22:46:52 2016-03-25 18:18:14 1              
##  2 id13~ 1         2016-01-05 06:14:15 2016-01-31 01:01:07 1              
##  3 id03~ 1         2016-02-13 22:38:00 2016-03-08 15:57:38 2              
##  4 id18~ 1         2016-01-05 00:19:42 2016-01-27 11:08:38 1              
##  5 id19~ 2         2016-02-15 23:18:06 2016-02-16 23:17:58 2              
##  6 id05~ 2         2016-05-31 13:00:39 2016-06-01 13:00:30 1              
##  7 id09~ 2         2016-05-06 00:00:10 2016-05-07 00:00:00 1              
##  8 id28~ 2         2016-06-30 16:37:52 2016-07-01 16:37:39 1              
##  9 id13~ 2         2016-06-23 16:01:45 2016-06-24 16:01:30 1              
## 10 id25~ 2         2016-05-17 22:22:56 2016-05-18 22:22:35 4              
## # ... with 6 more variables: pickup_longitude <dbl>,
## #   pickup_latitude <dbl>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <int>

There are 4 trips who lasted 22 days or more up to 41 and numerous trips who lasted around 24 hours! Some of them have strange date time for pickup and dropoff time being both at OO:OO:OO (id0953667). I did not thought traffic jam was that bad in NYC!

Let’s put on a map the trips which lasted more than 24 hours :

xtrm_long_trips = cabs %>% filter(trip_duration>24*3600) # data with more than 24 hours trips

xtrm_map = leaflet(data = xtrm_long_trips) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~ pickup_longitude, ~pickup_latitude, radius = 4,
                   color = "green", fillOpacity = 0.3,
                   label =~as.character(paste(id,";",
                                              "duration (min):",round(trip_duration/60,0)))) %>%
  addCircleMarkers(~ dropoff_longitude, ~dropoff_latitude, radius = 4,
                   color = "orange", fillOpacity = 0.3,
                   label =~as.character(paste(id,";",
                                              "duration (min):",round(trip_duration/60,0))))
for(i in 1:nrow(xtrm_long_trips)){
    xtrm_map <- addPolylines(xtrm_map, lat = as.numeric(xtrm_long_trips[i, c(7, 9)]), 
                               lng = as.numeric(xtrm_long_trips[i, c(6, 8)]))
} # adding the connection betweek each pickup and dropoff
xtrm_map # map with green points as pickup location, orange points as dropoff and the link between them 

Seems normal trips with 2 of them from JFK Aiport and the others in or close to Manhattan. Still, it’s very strange to stay that long in a taxi for rather short trips even with good music…

As they are very likely to be real trips but suffer from a technical issue, we will assigned them the mean duration time as trip duration.

cabs = cabs %>%
    mutate(trip_duration = replace(trip_duration, trip_duration > 24*3600, mean(trip_duration))) # assign these trips the mean

We also saw a lot of trips who lasted almost 24 hours. Let’s do the same as above to identify them.

day_long_trips = cabs %>% filter(trip_duration > 20*3600) # data with more than 20 hours trips

day_map = leaflet(data = day_long_trips) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~ pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "green", fillOpacity = 0.3) %>%
  addCircleMarkers(~ dropoff_longitude, ~dropoff_latitude, radius = 1,
                   color = "orange", fillOpacity = 0.3)
for(i in 1:nrow(day_long_trips)){
    day_map <- addPolylines(day_map, lat = as.numeric(day_long_trips[i, c(7, 9)]), 
                               lng = as.numeric(day_long_trips[i, c(6, 8)]),
                            color = "blue", weight = 2,opacity = 0.1)
} # adding the connection betweek each pickup and dropoff
day_map

The majority of the trips are within Manhattan or in NYC area. Numerous of them are between Manhattan and the airports (to Newark Liberty in the south west, to LaGuardia in the East and to JFK in the south east). We can assume that these trips were real but there was a technical issue as for the ones which lasted several days.

We will assigned them the mean value.

cabs = cabs %>%
    mutate(trip_duration = replace(trip_duration, trip_duration > 20*3600, mean(trip_duration))) # assign these trips the mean

We saw in the summary that at least 75% of trips who lasted less than 17 minutes. Nevertheless, there are trips who lasted more than 3, 5 or even +10 hours. Let’s have a look at trips who lasted more than 3 hours.

long_trips = cabs %>% filter(trip_duration > 3*3600) # data with more than 20 hours trips

long_map = leaflet(data = long_trips) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~ pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "green", fillOpacity = 0.3, popup ="pickup")%>%
  addCircleMarkers(~ dropoff_longitude, ~dropoff_latitude, radius = 1,
                   color = "orange", fillOpacity = 0.3, popup ="dropoff")
for(i in 1:nrow(long_trips)){
    long_map <- addPolylines(long_map, lat = as.numeric(long_trips[i, c(7, 9)]), 
                               lng = as.numeric(long_trips[i, c(6, 8)]),
                            color = "blue", weight = 2,opacity = 0.1)
} # adding the connection betweek each pickup and dropoff
long_map

Same as for the previous ones. Most of them are the journey for JFK or La Guardia. These are real trips but probably a technical error for time duration value.

It’s very unlikely that people spend more than 3 hours in taxi for these kinds of trips. Even for trips with higher distance (Manhattan - JFK) the duration is around 35 min when traffic is fluid according Google Maps. It can increase to +2H with traffic jam. Thus, we will keep the duration of those who lasted less than 3 hours, but assign the mean for those who lasted more than 3 hours.

cabs = cabs %>%
    mutate(trip_duration = replace(trip_duration, trip_duration > 3*3600, mean(trip_duration))) # assign these trips the mean

Very short trips investigation

They were also very short trips. Let’s dig into them by taking the ones who lasted less than 2 minutes :

short_trips <- cabs %>% filter(trip_duration < 2*60)
short_trips <- sample_n(short_trips, 2000) # sample of around 10% for visualisation and loading purpose
short_map = leaflet(data = short_trips) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~ pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "green", fillOpacity = 0.3)%>%
  addCircleMarkers(~ dropoff_longitude, ~dropoff_latitude, radius = 1,
                   color = "orange", fillOpacity = 0.3)
for(i in 1:nrow(short_trips)){
    short_map <- addPolylines(short_map, lat = as.numeric(short_trips[i, c(7, 9)]), 
                               lng = as.numeric(short_trips[i, c(6, 8)]),
                            color = "blue", weight = 2,opacity = 0.1)
} # adding the connection betweek each pickup and dropoff
short_map

Very short trips may actually happen, we indeed see the connection between numerous points within Manhattan. However, a lot of them seem to have the same location for pickup and dropoff. Which can actually happened with a direct annulation!

We will add a new variable : the distance. Thus, we will remove all the trips with a distance of 0 (or very close to it) in the next part of our inspection

I.4.c Trip distance

From the coordinates of the pickup and dropoff points we can calculate the direct distance between the two points. These values correspond to the minimum possible travel distance.

The distHaversine from geosphere package gives the shortest distance between two points. This method assumes a spherical earth, ignoring ellipsoidal effects. The distance in by default in meters.

pickups = cabs %>% select(pickup_longitude,pickup_latitude) # isolate pickups long and lat
dropoffs =  cabs %>% select(dropoff_longitude,dropoff_latitude) # isolate dropoffs long and lat

cabs$distance <- distCosine(pickups, dropoffs) # add it to the data

Let’s check distance trips :

summary(cabs$distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1233    2095    3423    3873   45160
cabs %>%
  ggplot(aes(distance)) +
  geom_histogram(fill = "blue", bins = 200) + xlab('distance in meters') +
  scale_x_log10() +
  scale_y_sqrt()

We can see that 50% of the trips had a distance less than +2km. Probably most of them are within Manhattan. 75% of them had a distance of less than 3.9km. We see 2 peaks of distance above 10kms, probably trips from or to JFK airport.

We will check the extreme values.

Very short distance trips investigation

We see strange trips with very short distance (less than 100m for instance).

cabs %>% filter(distance < 100) %>% head(10) # trips less than 100 meters
## # A tibble: 10 x 12
##    id    vendor_id pickup_datetime     dropoff_datetime    passenger_count
##    <chr> <fct>     <dttm>              <dttm>              <fct>          
##  1 id34~ 2         2016-02-29 18:39:12 2016-02-29 18:42:59 1              
##  2 id01~ 2         2016-05-25 06:40:04 2016-05-25 06:43:13 1              
##  3 id09~ 2         2016-05-10 18:07:52 2016-05-10 18:26:21 2              
##  4 id29~ 1         2016-06-27 11:22:37 2016-06-27 11:23:17 1              
##  5 id07~ 1         2016-05-20 14:04:03 2016-05-20 14:33:41 1              
##  6 id11~ 2         2016-05-16 23:15:13 2016-05-16 23:31:00 6              
##  7 id35~ 2         2016-05-05 02:49:46 2016-05-05 02:50:53 1              
##  8 id01~ 2         2016-02-13 17:24:28 2016-02-13 17:56:13 1              
##  9 id34~ 1         2016-01-25 19:45:12 2016-01-25 19:54:52 1              
## 10 id32~ 2         2016-05-21 12:46:46 2016-05-21 12:46:49 2              
## # ... with 7 more variables: pickup_longitude <dbl>,
## #   pickup_latitude <dbl>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <dbl>,
## #   distance <dbl>

It’s possible to ride a taxi for hundreds of meters. However some duration associated to short distance trips are strangely very high such as 33 minutes (~2000 s) for 44 meters is hardly possible (id0924324). Maybe there were a lot of traffic, but then you use Citymapper and take the underground! Even worse for no distance trip such as id0131920 who lasted +15 minutes (~1100)

Thus, we will assume that taking a 3min ride for 500m is interesting and keep only short trips matching this features.

cabs = cabs %>% filter(distance > 0 | (distance < 500) & trip_duration < 180)

Long distances trips investigation

Let’s now check very long distance trips.

cabs %>% arrange(desc(distance)) %>% head(10)
## # A tibble: 10 x 12
##    id    vendor_id pickup_datetime     dropoff_datetime    passenger_count
##    <chr> <fct>     <dttm>              <dttm>              <fct>          
##  1 id38~ 1         2016-06-19 23:30:12 2016-06-20 00:36:09 4              
##  2 id03~ 2         2016-05-02 13:55:18 2016-05-02 15:13:31 1              
##  3 id24~ 2         2016-02-01 17:18:00 2016-02-01 18:49:35 1              
##  4 id34~ 1         2016-04-04 14:31:15 2016-04-04 15:42:17 1              
##  5 id25~ 2         2016-03-07 18:39:16 2016-03-07 19:45:15 1              
##  6 id39~ 1         2016-03-19 12:29:17 2016-03-19 13:21:44 2              
##  7 id19~ 2         2016-03-16 14:54:30 2016-03-16 17:10:43 1              
##  8 id16~ 2         2016-06-13 18:28:05 2016-06-13 19:30:23 1              
##  9 id37~ 1         2016-05-24 14:23:54 2016-05-24 15:59:27 1              
## 10 id25~ 2         2016-01-26 15:27:55 2016-01-26 16:33:09 1              
## # ... with 7 more variables: pickup_longitude <dbl>,
## #   pickup_latitude <dbl>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, store_and_fwd_flag <chr>, trip_duration <dbl>,
## #   distance <dbl>

We can see that the highest distance is +45km which is totally feasible in taxi.

Let’s visualise the ones who had a distance above 10km.

long_distance_trips <- cabs %>% filter(distance > 10*1000)
long_distance_trips <- sample_n(long_distance_trips, 2000) # sample for visualisation and loading purpose

long_distance_map = leaflet(data = long_distance_trips) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~ pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "green", fillOpacity = 0.3)%>%
  addCircleMarkers(~ dropoff_longitude, ~dropoff_latitude, radius = 1,
                   color = "orange", fillOpacity = 0.3)
for(i in 1:nrow(long_distance_trips)){
    long_distance_map <- addPolylines(long_distance_map, lat = as.numeric(long_distance_trips[i, c(7, 9)]), 
                               lng = as.numeric(long_distance_trips[i, c(6, 8)]),
                            color = "blue", weight = 2,opacity = 0.1)
} # adding the connection between each pickup and dropoff
long_distance_map

We can see that the vast majority of them are trips who started from the airport with dropoffs all over the NYC area. These are very likely to be real trips so we will keep them.

I.4.d Passenger count

In the summary we notice that some trips were with O passengers and up to 9. As we want to focus on popular trips, we remoove the ones with 0 passenger

cabs = cabs %>% filter(passenger_count != 0)

Let’s now visualize those with at least 5 passengers. I though that yellow cabs had only 4 sits! Maybe they launch yellow vans, which would be interesting to investigate opportunites to launch a kind of ridesharing service ;)

van_rides = cabs %>% filter(passenger_count == c(5,6,7,8,9))
van_rides <- sample_n(van_rides,2000) # sample of around 10% for visualisation and loading purpose

van_map = leaflet(data = van_rides) %>% addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(~ pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "green", fillOpacity = 0.3)%>%
  addCircleMarkers(~ dropoff_longitude, ~dropoff_latitude, radius = 1,
                   color = "orange", fillOpacity = 0.3)
for(i in 1:nrow(van_rides)){
    van_map <- addPolylines(van_map, lat = as.numeric(van_rides[i, c(7, 9)]), 
                               lng = as.numeric(van_rides[i, c(6, 8)]),
                            color = "blue", weight = 2,opacity = 0.1)
} # adding the connection between each pickup and dropoff
van_map

Same kind of trips as the ones with long distance. We will keep them as they are probably real ones. Taxi Vans are operateing in NYC.

End of cleaning

summary(cabs)
##       id            vendor_id  pickup_datetime              
##  Length:1453516     1:676876   Min.   :2016-01-01 00:00:17  
##  Class :character   2:776640   1st Qu.:2016-02-17 17:00:16  
##  Mode  :character              Median :2016-04-01 17:18:23  
##                                Mean   :2016-04-01 10:11:40  
##                                3rd Qu.:2016-05-15 03:44:10  
##                                Max.   :2016-06-30 23:59:39  
##                                                             
##  dropoff_datetime              passenger_count   pickup_longitude
##  Min.   :2016-01-01 00:03:31   1      :1030087   Min.   :-74.26  
##  1st Qu.:2016-02-17 17:17:29   2      : 209569   1st Qu.:-73.99  
##  Median :2016-04-01 17:34:01   5      :  77812   Median :-73.98  
##  Mean   :2016-04-01 10:27:39   3      :  59702   Mean   :-73.97  
##  3rd Qu.:2016-05-15 03:59:01   6      :  48064   3rd Qu.:-73.97  
##  Max.   :2016-07-01 23:02:03   4      :  28279   Max.   :-73.70  
##                                (Other):      3                   
##  pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
##  Min.   :40.51   Min.   :-74.26    Min.   :40.48    Length:1453516    
##  1st Qu.:40.74   1st Qu.:-73.99    1st Qu.:40.74    Class :character  
##  Median :40.75   Median :-73.98    Median :40.75    Mode  :character  
##  Mean   :40.75   Mean   :-73.97    Mean   :40.75                      
##  3rd Qu.:40.77   3rd Qu.:-73.96    3rd Qu.:40.77                      
##  Max.   :40.91   Max.   :-73.70    Max.   :40.92                      
##                                                                       
##  trip_duration        distance    
##  Min.   :    1.0   Min.   :    0  
##  1st Qu.:  397.0   1st Qu.: 1239  
##  Median :  662.0   Median : 2101  
##  Mean   :  836.1   Mean   : 3432  
##  3rd Qu.: 1071.0   3rd Qu.: 3882  
##  Max.   :10731.0   Max.   :45160  
## 

After cleaning, we have kept 1,453,516 which represents 99% of the data. The outliers were not that numerous, luckily for us!

III. Adding data weather to explain outliers

III.a Weather Data inspection

weather <- read_csv("weather_data_nyc_centralpark_2016.csv") # read the data

# reformating date, Fahrenheit in Celsius and factor values for precipitation and snow
weather <- weather %>%
  mutate(date = as.Date(weather$date,format="%d-%m-%y"),
         `maximum temperature` = fahrenheit.to.celsius(`maximum temperature`),
         `minimum temperature` = fahrenheit.to.celsius(`minimum temperature`),
         `average temperature` = fahrenheit.to.celsius(`average temperature`),
         rain = as.numeric(ifelse(precipitation == "T", "0.01", precipitation)), # give the numeric value of 0.01 if therese was a trace, the minimum recorded is 0.1 so it has to be below!
         snow_fall = as.numeric(ifelse(`snow fall` == "T", "0.01", `snow fall`)),
         snow_depth = as.numeric(ifelse(`snow depth` == "T", "0.01", `snow depth`)))
weather = weather %>% filter(date < '2016-07-01') #keeping only the value of our period in cabs
head(weather)
## # A tibble: 6 x 10
##   date       `maximum temper~ `minimum temper~ `average temper~
##   <date>                <dbl>            <dbl>            <dbl>
## 1 2016-01-01             6.67             1.11             3.89
## 2 2016-01-02             4.44            -1.11             1.67
## 3 2016-01-03             7.78             0.56             4.17
## 4 2016-01-04             1.67           -10.6             -4.44
## 5 2016-01-05            -1.67           -12.2             -6.94
## 6 2016-01-06             5               -8.89            -1.94
## # ... with 6 more variables: precipitation <chr>, `snow fall` <chr>, `snow
## #   depth` <chr>, rain <dbl>, snow_fall <dbl>, snow_depth <dbl>
summary(weather)
##       date            maximum temperature minimum temperature
##  Min.   :2016-01-01   Min.   :-10.00      Min.   :-21.110    
##  1st Qu.:2016-02-15   1st Qu.:  6.67      1st Qu.: -1.110    
##  Median :2016-03-31   Median : 13.33      Median :  5.000    
##  Mean   :2016-03-31   Mean   : 13.89      Mean   :  4.792    
##  3rd Qu.:2016-05-15   3rd Qu.: 20.42      3rd Qu.: 11.110    
##  Max.   :2016-06-30   Max.   : 31.67      Max.   : 21.110    
##  average temperature precipitation       snow fall        
##  Min.   :-15.560     Length:182         Length:182        
##  1st Qu.:  2.570     Class :character   Class :character  
##  Median :  9.440     Mode  :character   Mode  :character  
##  Mean   :  9.339                                          
##  3rd Qu.: 15.762                                          
##  Max.   : 25.830                                          
##   snow depth             rain           snow_fall        snow_depth     
##  Length:182         Min.   :0.00000   Min.   :0.0000   Min.   : 0.0000  
##  Class :character   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 0.0000  
##  Mode  :character   Median :0.00000   Median :0.0000   Median : 0.0000  
##                     Mean   :0.09352   Mean   :0.1466   Mean   : 0.4619  
##                     3rd Qu.:0.04000   3rd Qu.:0.0000   3rd Qu.: 0.0000  
##                     Max.   :1.24000   Max.   :9.5000   Max.   :10.0000

III.b Explain outliers with weather data

In the inspection part, we saw a drop at the end of January and at the end of May. Let’s see if we can explain it with the weather data.

We will calculate using the cabs dataset the total number of rides, passengers, kms driven and time duration for each day. Then we will merge it with the weather conditions to find correlations.

cabs_weather = cabs %>% mutate(date = date(pickup_datetime)) %>% group_by(date) %>%
  summarise(tot_rides=n(),tot_passenger=sum(as.numeric(passenger_count)),
            tot_kms=(sum(distance)/1000),
            tot_hours_duration =(sum(trip_duration)/3600)) # converting values intelligble way
cabs_weather = left_join(cabs_weather,weather,by="date") # merging on date
head(cabs_weather)
## # A tibble: 6 x 14
##   date       tot_rides tot_passenger tot_kms tot_hours_durat~
##   <date>         <int>         <dbl>   <dbl>            <dbl>
## 1 2016-01-01      7141         19994  27249.            1423.
## 2 2016-01-02      6491         18114  22943.            1286.
## 3 2016-01-03      6327         17323  25053.            1259.
## 4 2016-01-04      6693         17705  23521.            1335.
## 5 2016-01-05      7173         18959  23646.            1472.
## 6 2016-01-06      7341         19465  23680.            1527.
## # ... with 9 more variables: `maximum temperature` <dbl>, `minimum
## #   temperature` <dbl>, `average temperature` <dbl>, precipitation <chr>,
## #   `snow fall` <chr>, `snow depth` <chr>, rain <dbl>, snow_fall <dbl>,
## #   snow_depth <dbl>

Let’s visualise the number of rides with weather conditions

ggplot(cabs_weather, aes(date)) + 
  geom_line(aes(y = snow_fall, color = "snow fall"),size=1) + 
  geom_line(aes(y = snow_depth, color = "snow depth"),size=1) +
  labs(x = "Date", y = "Inches") + scale_x_date(limits = ymd(c("2016-01-01", "2016-06-30")))

ggplot(cabs_weather, aes(date)) + 
  geom_line(aes(y = `average temperature`, color = "average temperature"),size=1) + 
  geom_line(aes(y = `maximum temperature`, color = "maximum temperature"),size=1) +
  geom_line(aes(y = `minimum temperature`, color = "minimum temperature"),size=1) +
  labs(x = "Date", y = "Tempeature in Celcius") + scale_x_date(limits = ymd(c("2016-01-20", "2016-02-08")))
## Warning: Removed 162 rows containing missing values (geom_path).

## Warning: Removed 162 rows containing missing values (geom_path).

## Warning: Removed 162 rows containing missing values (geom_path).

ggplot(cabs_weather, aes(date)) + 
  geom_line(aes(y = tot_rides),size=1, color = "blue") +
  labs(x = "Date", y = "Number of rides") + scale_x_date(limits = ymd(c("2016-01-01", "2016-06-30")))

The drop in trip volume matches with biggest snow fall of the period in NYC on January 23rd. By making a quick research, I found there was a huge blizzard in NYC, being a record-breaking snowfall.

The snowfall on the 5th February did not impacted as much the rides. We can see that the average temperature was at 0 degrees with a maximum up to 6 degrees, so it might has melt quickly! (on the 23rd January even the maximum temperature was below 0 degrees).

Regarding the second largest drop, it happened on the 30th May which was the Memorial Day in USA. It was a Monday and people were in majority not working, which can explains the drop.

Thanks a lot for reading - I would be happy to present you my results into more details!

Eddy OHAYON